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ABSTRACT 

The purpose of the present study was to empirically 
investigate the stability and accuracy of one suggested method for 
matching test difficulty to examinee ability level. Students 1 answers 
to traditional classroom tests were rescored by computer as if the 
examinations had been flexilevel tests, the scores thus obtained were 
found to correlate highly with the traditional test scores (0.8994 to 
0.9478), thereby indicating that flexilevel test scores are 
sufficiently stable and accurate to allow their use for classroom 
evaluation purposes. (Author) 
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Introduction 

Conventional educational testing procedures are, to some extent, 
inefficient and inaccurate because they require that a heterogeneous 
.group of individuals attempt every test 'item. Wood (1969) has sug- 
gested that measurement might well be made more efficient and more 

1 

accurate if pupils could be routed through .a test so that they spend 
most of their time working on 'items appropriate to their ability 
, level. In the case of classroom evaluation, the use of a testing 
procedure which matches the difficulty of the items administered to * 
student ability level seems desirable from at least two points of 
view. First of all, the student will not be jrequire4 to answeu as 
many questions as in a conventional test, thereby reducing testing 
time. Secondly, since students will not be required to answer items 
not geared to their general ability level, they will encounter fewer 
failures which seems psychologically desirable. 

If a procedure for matching test difficulty to examinees on the 
basis of the examinee's ability level is to be acceptable for use as 
a classroom evaluation technique, then questions about the stability 
and accuracy of scores on such an instrument must be investigated. 
More specifically, studies should, investigate the correlation between 
scores from such an instrument and scores obtained from a conventional 
test of the same objectives. The purpose of the present study was to 
do this enpirically for one suggested method of matching test diffi- 
culty to examinee ability level. 
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Review of Related Research 

Tests which permit this kind of measurement have been called 
"branched," "computer assisted," "individualized," "programmed," 
"sequential," And, perhaps most recently, "tailored." * Although 
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these different types of tests may differ somewhat, basically they 
require all pupils to begin with the same item; however, the items 
they subsequently encounter are always dependent upon their response 
to the item they have just answered. 0 . 

v One of the earliest' studies on the subject was a dissertation 
by Patterson (1962). Patterson used probability models and iiypothe- 
tical populations and found that, for the models considered, zhe se- 
quential test discriminated better at the extremes than did the con- 
ventional test. 

Bayroff and Seeley (1967) adMnistered a verbal and' an arith- 
metic reasoning branching test to 102 subjects* The branching was 
based on item difficulties and each pupil responded to either eight 

or nine items depending on the particular branch he followed. As 

< 

part of the study a conventional f>0-item verbal test and a UO-itsm > 
arithmetic reasoning test were also administered. Correlations be- 
tween the conventional and branching tests ranged from .7^ to .78* 
In a study bylfood (1969) three different "tailored" mathema- 
tics tests were prepared. The tests consisted of four, five, and 
six items respectively and were administered to a sample of 91 stu- 
dents. The results on each' of the "tailored" tests were then corre- 
lated with mathematics grades with the highest correlation being .35. 

Linn, Rock, & Cleaiy (1968aj 1968b; 1969) conducted studies 
which used existing item data for Uj885 eleventh grade students on 
the 190 verbal-type items of the SCAT and STEP. In all, -the research- 
ers developed seven different programmed tests, The tests differed 
from each other primarily in the ways in which subjects were routed 
through the test. For five of 4>he experimental tests, two different 
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scoring procedures were used. Thus, a total of twelve different programmed 
tests Jtiere investigated. Correlations between the programmed and con- 
yentional tests ranged from .8738 to .9663. The experiment also exam- 
ined shortened conventional type tests and it was found that a 50 item 
■test would* produce about the sajae correlation as the best of the pro- 
grammed type. 

Lord (1970) reported the following requirements of "tailored™ 
tests. 

J v Development of a large number of items for pretesting, per- 
haps on the 'Order of several thousand. 

2. A very large pretesting to obtain adequate data for statisti- 
cal analysis of each item. 

3. A possibly dubious but very complex statistical analysis of 
pretest item data to estimate the necessaiy item parameters 
in advance of the main testing. 

h* A final pool of 500-5000 selected items, for actual adminis- 
tration. 

5. Computer simulations of perhaps a hundred different tailor- 
ing strategies and scoring methods in order to select item- 
administration and scoring procedures that will provide ac- 

* curate measurement at all ability levels. 

6. Test administration by a computer at terminals equipped with 
teletypes and visual display devices. 

7. Experimental testings and statistical analyses to demonstrate 
to the testing agency, to skeptical examinees, and to their 
lawyers that the scoring method is fair, in the sense of as- 
signing approximately the same score to an examinee regard- 
less of which subtest of items he happens to take (pp. 1-2). 

The Flexilevel Test 

It would seem that, in light of the aforementioned requirements, 

the use of the "tailored" test is beyond the reach of the typical 

classroom teacher; and, indeed, of many standardized test develbper 

To a large degree the matching of item ^[fficulty level with abili 

level can also^be accomplished by what Lord (1970; 1971) calls th 

flexilevel test. The flexilevel test avoids many of the disadvan 
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tages of "tailored" tests and, thus, seems a more promising instru- 
ment especially for locally-constructed tests. ^ 

A conventional test may be modified to become a flexilevel 
test when the items are arranged approximately in order of difficulty,. 
The idea of a flexilevel test is that the examinee begins with the 
middle item on the test and receives immediate feedback on his res- 
ponse. After each correct response he proceeds to the next hardest 
unanswered it6m. When he answers an item incorrectly he attests 
the next easiest unanswered item. He continues until he has answered 
n = (N + 1.)/2 items, where N is the number of items on the convention- 
al test. 

« A theoretical study of the measurement properties of the flexi- 
level test .(Lord, 1971) showed that: 

Near the middle of the ability range for which the test is de- 
signed a flexilevel test is less effective than is a con?) arable 
peaked conventional test. In the outlying half of the ability 
range, the flexilevel test provides more accurate measurement in 
typical aptitude and achievement testing situations than a peaked 
conventional test composed of comparable items (p. 813). 

Procedure 

In order to en^irically investigate the stability and accuracy 
of flexilevel tests, the present study utilized data from five pre- 
viously administered conventional objective examinations. Three of 
the .tests considered were classroom examinations administered during 
a one semester junior level college course in introductory educational 
measurements. The tests contained 1*2, 36, and 36 items respectively. 
The remaining two examinations investigated were semester final exam- 
inations in a high school geometry course. These examinations con- ^ 
sisted of 100 and 70 items respectively. 



For each of the testSj each student 1 s answers were rescored by 
means of a computer scoring program as if the test had been a flexi- 
level test. This test is ,ref erred >&o as a simulated flexilevel test. 
To investigate the relationship between the students 1 scores on the 
simulated flexilevel test and the traditional test, the Pearson pro-^ 
duct-moment coefficient of correlation was then calculated. 

In addition to obtaining the correlation between each simulated 
flexilevel examination and the corresponding traditional test, the 

following additional evidence was obtained for the introductory educa 

J ' 

tional meagurements examinations. Two total scores were obtained for 
each subject in the course. The first of these total scores was ob- 
tained by summing the subject's standard scores on the traditional 
exams ancTthe second by summing the standard scores on the simulated 
% flexilevel exams. The two sets of total scores were then correlated. 
Data Source 

The data for the three educational measurements examinations 
used in the study were obtained from approximately 180 students en- 
rolled' in an introductory educational measurements course at the 
University of Kansas during the Spring semester of 1971. The data on 
the first geometry examination were obtained from i*12 students en- 
rolled in a geometry course at Shawnee Mission South High School, 
Shawnee Mission, Kansas. The d&ta on the second geometry exam were 
obtained from students enrolled in a geometry course at the same 
high school. The first geometry exam was administered at the end of 
the Spring 1971 semester. The second -was administered at the end of 
the Fall 1971 semester. 



Results 

For the three introductory educational measurements examinations 
considered in the present study, the correlations between the original 
examinations and the corresponding simulated flexileyel test were 
0.899k, 0.9k6k, and 0.9090 respectively. For the two high school geo- 
metry exams investigated, the correlations were 0.9353 and .91*78. In- 
terpreting the five obtained correlations as measures of parallel 
forms reliability (measures of equivalence and stability) indicates 
that flexilevel test scores could validly be substituted for scores 
obtained by administering a traditional test. 

In the introductory educational measurements course the corre- 
lation between total scores based on traditional tests and total 
scores based on simulated flexilevel test scores was 0.955. The mag- 
nitude of this correlation further indicates that flexilevel test 
scores possess the necessary stability and 'equivalence characteristics 
that they may be substituted for scores on a traditional classroom 
examination. 

t * Summary and Implications 

; The results of the present study demonstrate that scores ob- 
tained from simulated flexilevel tests c$n validly be substituted for 
traditional test, scores of the same objectives. If further investi- 
gations using actual flexilevel tests in the classroom show the same 
high degree of relationship between traditional and flexilevel test 
scores, then teachers will have an easy to use method of making test- 
'ing more 'efficient. 
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